The actual content of the text mining annotations is stored in high-performance Lucene indexes. This decision was taken when the inital approach, based purely on a relational database, reached a point where the overall performance and maintenance of the database was not satisfactory.
As mentioned in a previous chapter, each document is presented with sevral n-dimentional vectors. This representation is convenient for further processing of the document. However, it is not easily stored and manuipulated in a relational database. One of the problems related to this representation is the amount of data that should be stored are database rows. Thus, we adoped another approach and the document vectors are stored in a Lucene index. Each document vector consists of numeric values so a simple "WhitespaceAnnalyzer" (org.apache.lucene.analysis.WhitespaceAnalyzer) is used for storing and searching in the indexes. The performance for storing and retreiving document information is constant in the scope of 1'000'000 documents with an average 23'000 tokens per document.
The utilization of Lucene as datasource allows us to execute queries with optimal performance, such as:
Each document (vector) consists of two mandatory fields:
The rest of the fields are detoned of the type of the document vector. For example, the vector of tokens of a document additionally contains the "tokens" and "lemmas" fields; the vector for the noun phrases contains the "nps" and "heads" fields.
Each of the additinal document fields contains whitespace-separated integer numbers. The numbers correspond to indetificators (ids) of annotations, stored in the relational text mining database.
ATLAS (Applied Technology for Language-Aided CMS) is a project funded by the European Commission under the CIP ICT Policy Support Programme.